Although there have been lot of studies undertaken in the past on factors affecting life expectancy considering demographic variables, income composition and mortality rates. It was found that affect of immunization and human development index was not taken into account in the past. Also, some of the past research was done considering multiple linear regression based on data set of one year for all the countries. Hence, this gives motivation to resolve both the factors stated previously by formulating a regression model based on mixed effects model and multiple linear regression while considering data from a period of 2000 to 2015 for all the countries. Important immunization like Hepatitis B, Polio and Diphtheria will also be considered. In a nutshell, this study will focus on immunization factors, mortality factors, economic factors, social factors and other health related factors as well. Since the observations this dataset are based on different countries, it will be easier for a country to determine the predicting factor which is contributing to lower value of life expectancy. This will help in suggesting a country which area should be given importance in order to efficiently improve the life expectancy of its population.
Display the ability to build regression models using the skills and discussions from Unit 1 and 2 with the purpose of identifying key relationships, interpreting those relationships, and making good predictions.
Reminder, key here is to tell a good story.
Perform regression analysis
Hypothesis Testing
Interpret the coefficients
Confidence intervals
Practical and statistical significance
- Product the best predictions as possible
- Interpretation is no longer required, hence complexity is no longer an issue
Feature selection to avoid overfitting
Create the model
Compare model 1 vs. model 2
Comment on the differences of the models and whether model 2 brings any benefit
- Nonparametric technique
- kNN or regression trees (select one)
Set of predictors from previous regression: (fill this out)
Model
A brief description of your nonparametric model’s strategy to make a prediction. Include Pros and Cons.
Provide any additional details that you feel might be necessary to report.
Report the test ASE using this nonparametric model so we can see how well it does compared to regression.
## Life.expectancy.1 count mean sd
## Length:2 Min. : 41.00 Min. :5.454 Min. :2.422
## Class :character 1st Qu.: 66.25 1st Qu.:5.693 1st Qu.:2.516
## Mode :character Median : 91.50 Median :5.933 Median :2.610
## Mean : 91.50 Mean :5.933 Mean :2.610
## 3rd Qu.:116.75 3rd Qu.:6.173 3rd Qu.:2.705
## Max. :142.00 Max. :6.413 Max. :2.799
## Life.expectancy.1 count mean sd
## Length:2 Min. : 41.00 Min. : 95.05 Min. : 169.8
## Class :character 1st Qu.: 66.25 1st Qu.: 387.22 1st Qu.: 838.8
## Mode :character Median : 91.50 Median : 679.40 Median :1507.8
## Mean : 91.50 Mean : 679.40 Mean :1507.8
## 3rd Qu.:116.75 3rd Qu.: 971.58 3rd Qu.:2176.8
## Max. :142.00 Max. :1263.75 Max. :2845.8
##
## F test to compare two variances
##
## data: Total.expenditure by Life.expectancy.1
## F = 1.3359, num df = 140, denom df = 39, p-value = 0.2947
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.7751741 2.1398108
## sample estimates:
## ratio of variances
## 1.335916
##
## F test to compare two variances
##
## data: percentage.expenditure by Life.expectancy.1
## F = 280.92, num df = 141, denom df = 40, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 164.1030 448.0117
## sample estimates:
## ratio of variances
## 280.92
##
## Two Sample t-test
##
## data: Total.expenditure by Life.expectancy.1
## t = 1.9683, df = 179, p-value = 0.05058
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.002459138 1.921558429
## sample estimates:
## mean in group High mean in group Low
## 6.41305 5.45350
##
## Welch's Heteroscedastic F Test (alpha = 0.05)
## -------------------------------------------------------------
## data : percentage.expenditure and Life.expectancy.1
##
## statistic : 17.24616
## num df : 1
## denom df : 100.4256
## p.value : 6.901585e-05
##
## Result : Difference is statistically significant.
## -------------------------------------------------------------
## Life.expectancy.1 count mean sd
## Length:2 Min. :33.00 Min. :19.60 Min. :28.41
## Class :character 1st Qu.:38.25 1st Qu.:23.38 1st Qu.:29.28
## Mode :character Median :43.50 Median :27.16 Median :30.14
## Mean :43.50 Mean :27.16 Mean :30.14
## 3rd Qu.:48.75 3rd Qu.:30.94 3rd Qu.:31.01
## Max. :54.00 Max. :34.72 Max. :31.88
##
## F test to compare two variances
##
## data: Total.expenditure by Life.expectancy.1
## F = 1.6541, num df = 52, denom df = 31, p-value = 0.1357
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.8508704 3.0524391
## sample estimates:
## ratio of variances
## 1.654055
##
## Two Sample t-test
##
## data: percentage.expenditure by Life.expectancy.1
## t = -2.2986, df = 85, p-value = 0.02398
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -28.191389 -2.040987
## sample estimates:
## mean in group High mean in group Low
## 19.60255 34.71874
##
## Call:
## lm(formula = Life.expectancy ~ Adult.Mortality, data = Life_Expectancy_Df_2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.9654 -2.5457 0.8639 3.2843 13.1335
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.64428 0.71342 113.04 <2e-16 ***
## Adult.Mortality -0.06125 0.00391 -15.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.593 on 181 degrees of freedom
## Multiple R-squared: 0.5755, Adjusted R-squared: 0.5732
## F-statistic: 245.4 on 1 and 181 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = Life.expectancy ~ BMI, data = Life_Expectancy_Df_2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.2151 -4.5711 0.3012 4.2668 23.9674
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.69240 1.21867 52.26 < 2e-16 ***
## BMI 0.19423 0.02643 7.35 6.81e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.484 on 179 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.2318, Adjusted R-squared: 0.2275
## F-statistic: 54.02 on 1 and 179 DF, p-value: 6.809e-12
##
## Call:
## lm(formula = Life.expectancy ~ Alcohol, data = Life_Expectancy_Df_2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.3479 -4.4217 0.6886 5.5232 15.3815
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.1077 0.6866 99.192 < 2e-16 ***
## Alcohol 1.0732 0.1301 8.252 3.21e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.27 on 180 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.2745, Adjusted R-squared: 0.2704
## F-statistic: 68.1 on 1 and 180 DF, p-value: 3.214e-14
Linear correlations: - Schooling vs Income.composition.of.resource - thinness..1.19.years vs thinness.5.9.years - life exp. vs schooling - life exp. vs income - infant death vs under 5 death
removed variables that were correlated
## Country Year Status Life.expectancy
## Length:183 Min. :2014 Length:183 Min. :48.10
## Class :character 1st Qu.:2014 Class :character 1st Qu.:65.60
## Mode :character Median :2014 Mode :character Median :73.60
## Mean :2014 Mean :71.54
## 3rd Qu.:2014 3rd Qu.:76.85
## Max. :2014 Max. :89.00
##
## Adult.Mortality infant.deaths Alcohol percentage.expenditure
## Min. : 1.0 Min. : 0.00 Min. : 0.010 Min. : 0.00
## 1st Qu.: 66.0 1st Qu.: 0.00 1st Qu.: 0.010 1st Qu.: 11.06
## Median :135.0 Median : 2.00 Median : 0.320 Median : 151.10
## Mean :148.7 Mean : 24.56 Mean : 3.271 Mean : 1001.91
## 3rd Qu.:216.5 3rd Qu.: 18.00 3rd Qu.: 6.700 3rd Qu.: 703.21
## Max. :522.0 Max. :957.00 Max. :15.190 Max. :19479.91
## NA's :1
## Hepatitis.B Measles BMI under.five.deaths
## Min. : 2.00 Min. : 0 Min. : 2.00 Min. : 0.00
## 1st Qu.:79.00 1st Qu.: 0 1st Qu.:23.20 1st Qu.: 0.00
## Median :93.00 Median : 13 Median :47.40 Median : 3.00
## Mean :83.12 Mean : 1831 Mean :41.03 Mean : 32.89
## 3rd Qu.:97.00 3rd Qu.: 316 3rd Qu.:59.80 3rd Qu.: 22.00
## Max. :99.00 Max. :79563 Max. :77.10 Max. :1200.00
## NA's :10 NA's :2
## Polio Total.expenditure Diphtheria HIV.AIDS
## Min. : 8.00 Min. : 1.210 Min. : 2.00 Min. :0.100
## 1st Qu.:80.00 1st Qu.: 4.480 1st Qu.:83.00 1st Qu.:0.100
## Median :94.00 Median : 5.840 Median :94.00 Median :0.100
## Mean :84.73 Mean : 6.201 Mean :84.08 Mean :0.682
## 3rd Qu.:97.00 3rd Qu.: 7.740 3rd Qu.:97.00 3rd Qu.:0.400
## Max. :99.00 Max. :17.140 Max. :99.00 Max. :9.400
## NA's :2
## GDP Population thinness..1.19.years
## Min. : 12.28 Min. :4.100e+01 Min. : 0.100
## 1st Qu.: 617.99 1st Qu.:2.869e+05 1st Qu.: 1.500
## Median : 3154.51 Median :1.568e+06 Median : 3.300
## Mean : 10015.57 Mean :2.106e+07 Mean : 4.533
## 3rd Qu.: 8239.95 3rd Qu.:8.080e+06 3rd Qu.: 6.600
## Max. :119172.74 Max. :1.294e+09 Max. :26.800
## NA's :28 NA's :41 NA's :2
## thinness.5.9.years Income.composition.of.resources Schooling
## Min. : 0.100 Min. :0.3450 Min. : 4.90
## 1st Qu.: 1.500 1st Qu.:0.5700 1st Qu.:10.80
## Median : 3.400 Median :0.7220 Median :13.00
## Mean : 4.676 Mean :0.6884 Mean :12.89
## 3rd Qu.: 6.600 3rd Qu.:0.7960 3rd Qu.:14.90
## Max. :27.400 Max. :0.9450 Max. :20.40
## NA's :2 NA's :10 NA's :10
## Life.expectancy.1
## Length:183
## Class :character
## Mode :character
##
##
##
##
## [1] "Country" "Year"
## [3] "Status" "Life.expectancy"
## [5] "Adult.Mortality" "infant.deaths"
## [7] "Alcohol" "percentage.expenditure"
## [9] "Hepatitis.B" "Measles"
## [11] "BMI" "under.five.deaths"
## [13] "Polio" "Total.expenditure"
## [15] "Diphtheria" "HIV.AIDS"
## [17] "GDP" "Population"
## [19] "thinness..1.19.years" "thinness.5.9.years"
## [21] "Income.composition.of.resources" "Schooling"
## [23] "Life.expectancy.1"
Forward, backward, and stepwise regressions were run and all 3 resulted with the same 4 significant variables.
Variables - Adult.Mortality
- Total.expenditure - HIV.AIDS - Income.composition.of.resources
##
## Call:
## lm(formula = Life.expectancy ~ Adult.Mortality + Total.expenditure +
## HIV.AIDS + Income.composition.of.resources + Life.expectancy.1,
## data = df1_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7539 -1.5929 0.0074 1.6631 10.6939
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.76505 2.32027 21.879 < 2e-16 ***
## Adult.Mortality -0.01700 0.00378 -4.496 1.56e-05 ***
## Total.expenditure 0.39230 0.10971 3.576 0.000497 ***
## HIV.AIDS -0.55729 0.24900 -2.238 0.026985 *
## Income.composition.of.resources 31.75821 2.96054 10.727 < 2e-16 ***
## Life.expectancy.1Low -2.90429 1.08518 -2.676 0.008441 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.029 on 125 degrees of freedom
## Multiple R-squared: 0.8809, Adjusted R-squared: 0.8761
## F-statistic: 184.9 on 5 and 125 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = Life.expectancy ~ Adult.Mortality + infant.deaths +
## Alcohol + percentage.expenditure + Hepatitis.B + Measles +
## BMI + under.five.deaths + Polio + Total.expenditure + Diphtheria +
## HIV.AIDS + GDP + Population + thinness..1.19.years + thinness.5.9.years +
## Income.composition.of.resources + Schooling + Life.expectancy.1,
## data = df1_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.6206 -1.7993 0.1196 1.4778 9.0945
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.416e+01 3.441e+00 15.739 < 2e-16 ***
## Adult.Mortality -1.707e-02 4.062e-03 -4.201 5.39e-05 ***
## infant.deaths 5.084e-02 5.664e-02 0.897 0.371396
## Alcohol 5.116e-02 9.458e-02 0.541 0.589625
## percentage.expenditure 3.714e-04 4.545e-04 0.817 0.415602
## Hepatitis.B -8.781e-03 2.818e-02 -0.312 0.755941
## Measles -2.284e-05 4.748e-05 -0.481 0.631396
## BMI -7.922e-03 1.954e-02 -0.406 0.685872
## under.five.deaths -3.548e-02 3.892e-02 -0.912 0.363915
## Polio -7.978e-03 2.073e-02 -0.385 0.701115
## Total.expenditure 3.129e-01 1.243e-01 2.518 0.013221 *
## Diphtheria 2.522e-02 3.397e-02 0.742 0.459403
## HIV.AIDS -5.789e-01 2.628e-01 -2.202 0.029702 *
## GDP -2.934e-05 6.516e-05 -0.450 0.653446
## Population -2.321e-09 6.646e-09 -0.349 0.727600
## thinness..1.19.years 1.896e-02 2.296e-01 0.083 0.934327
## thinness.5.9.years -1.528e-01 2.261e-01 -0.676 0.500610
## Income.composition.of.resources 2.717e+01 7.106e+00 3.824 0.000217 ***
## Schooling 2.068e-02 2.732e-01 0.076 0.939815
## Life.expectancy.1Low -3.215e+00 1.303e+00 -2.468 0.015113 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.12 on 111 degrees of freedom
## Multiple R-squared: 0.8877, Adjusted R-squared: 0.8685
## F-statistic: 46.2 on 19 and 111 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = Life.expectancy ~ Adult.Mortality + Total.expenditure +
## HIV.AIDS + Income.composition.of.resources + Life.expectancy.1,
## data = df1_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7539 -1.5929 0.0074 1.6631 10.6939
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.76505 2.32027 21.879 < 2e-16 ***
## Adult.Mortality -0.01700 0.00378 -4.496 1.56e-05 ***
## Total.expenditure 0.39230 0.10971 3.576 0.000497 ***
## HIV.AIDS -0.55729 0.24900 -2.238 0.026985 *
## Income.composition.of.resources 31.75821 2.96054 10.727 < 2e-16 ***
## Life.expectancy.1Low -2.90429 1.08518 -2.676 0.008441 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.029 on 125 degrees of freedom
## Multiple R-squared: 0.8809, Adjusted R-squared: 0.8761
## F-statistic: 184.9 on 5 and 125 DF, p-value: < 2.2e-16